228 research outputs found
01 Text Processing 1 - Data Mining - Ingegneria e Scienze Informatiche, Cesena
dati strutturati, semi-strutturati e destrutturati, information retrieval e text mining, rappresentazione di documenti, modelli di ricerca booleani, il processo di indicizzazione di documenti, tokenizzazione, normalizzazione, lemmatizzazione, algoritmi di stemming, ricerche con indici, altre ottimizzazioni nella ricerc
02 Riduzione della DimensionalitĂ e LSA - Data Mining - Ingegneria e Scienze Informatiche, Cesena
Selezione di Feature con Mutual Information e Test Chiquadro, Latent Semantic Analysis (LSA
05 esercitazione su text classification in WEKA - Data Mining - Ingegneria e Scienze Informatiche, Cesena
05 esercitazione su text classification in WEKA - Data Mining - Ingegneria e Scienze Informatiche, Cesen
Discriminative Marginalized Probabilistic Neural Method for Multi-Document Summarization of Medical Literature
Although current state-of-the-art Transformer-based solutions succeeded in a wide range for single-document NLP tasks, they still struggle to address multi-input tasks such as multi-document summarization. Many solutions truncate the inputs, thus ignoring potential summary-relevant contents, which is unacceptable in the medical domain where each information can be vital. Others leverage linear model approximations to apply multi-input concatenation, worsening the results because all information is considered, even if it is conflicting or noisy with respect to a shared background. Despite the importance and social impact of medicine, there are no ad-hoc solutions for multi-document summarization. For this reason, we propose a novel discriminative marginalized probabilistic method (DAMEN) trained to discriminate critical information from a cluster of topic-related medical documents and generate a multi-document summary via token probability marginalization. Results prove we outperform the previous state-of-the-art on a biomedical dataset for multi-document summarization of systematic literature reviews. Moreover, we perform extensive ablation studies to motivate the design choices and prove the importance of each module of our method
Text-to-Text Extraction and Verbalization of Biomedical Event Graphs
Biomedical events represent complex, graphical, and semantically rich interactions expressed in the scientific literature. Almost all contributions in the event realm orbit around semantic parsing, usually employing discriminative architectures and cumbersome multi-step pipelines limited to a small number of target interaction types. We present the first lightweight framework to solve both event extraction and event verbalization with a unified text-to-text approach, allowing us to fuse all the resources so far designed for different tasks. To this end, we present a new event graph linearization technique and release highly comprehensive event-text paired datasets, covering more than 150 event types from multiple biology subareas (English language). By streamlining parsing and generation to translations, we propose baseline transformer model results according to multiple biomedical text mining benchmarks and NLG metrics. Our extractive models achieve greater state-of-the-art performance than single-task competitors and show promising capabilities for the controlled generation of coherent natural language utterances from structured data
Personalized Web Search via Query Expansion based on User’s Local Hierarchically-Organized Files
Users of Web search engines generally express information needs with short and ambiguous queries, leading to irrelevant results. Personalized search methods improve users’ experience by automatically reformulating queries before sending them to the search engine or rearranging received results, according to their specific interests. A user profile is often built from previous queries, clicked results or in general from the user’s browsing history; different topics must be distinguished in order to obtain an accurate profile. It is quite common that a set of user files, locally stored in sub-directory, are organized by the user into a coherent taxonomy corresponding to own topics of interest, but only a few methods leverage on this potentially useful source of knowledge. We propose a novel method where a user profile is built from those files, specifically considering their consistent arrangement in directories. A bag of keywords is extracted for each directory from text documents with in it. We can infer the topic of each query and expand it by adding the corresponding keywords, in order to obtain a more targeted formulation. Experiments are carried out using benchmark data through a repeatable systematic process, in order to evaluate objectively how much our method can improve relevance of query results when applied upon a third-party search engin
A Probabilistic Approach to the Drag-Based Model
The forecast of the time of arrival of a coronal mass ejection (CME) to Earth
is of critical importance for our high-technology society and for any future
manned exploration of the Solar System. As critical as the forecast accuracy is
the knowledge of its precision, i.e. the error associated to the estimate. We
propose a statistical approach for the computation of the time of arrival using
the drag-based model by introducing the probability distributions, rather than
exact values, as input parameters, thus allowing the evaluation of the
uncertainty on the forecast. We test this approach using a set of CMEs whose
transit times are known, and obtain extremely promising results: the average
value of the absolute differences between measure and forecast is 9.1h, and
half of these residuals are within the estimated errors. These results suggest
that this approach deserves further investigation. We are working to realize a
real-time implementation which ingests the outputs of automated CME tracking
algorithms as inputs to create a database of events useful for a further
validation of the approach.Comment: 18 pages, 4 figure
Analisi e gestione informatica di sequenze trascritte in organismi non-modello
2011/2012Il tema principale di questo lavoro di tesi è la discussione dei metodi che, mediante l’utilizzo di
strumenti creati ad-hoc e di software di terze parti, hanno permesso analizzare sequenze trascritte di
5 organismi non-modello: Mytilus galloprovincialis, Ruditapes philippinarum, Latimeria
menadoensis, Astacus leptodactylus e Procambarus clarkii.XXV Ciclo198
Learning to Predict the Stock Market Dow Jones Index Detecting and Mining Relevant Tweets
Stock market analysis is a primary interest for finance and such a challenging task that has always attracted many researchers. Historically, this task was accomplished by means of trend analysis, but in the last years text mining is emerging as a promising way to predict the stock price movements. Indeed, previous works showed not only a strong correlation between financial news and their impacts to the movements of stock prices, but also that the analysis of social network posts can help to predict them. These latest methods are mainly based on complex techniques to extract the semantic content and/or the sentiment of the social network posts. Differently, in this paper we describe a method to predict the Dow Jones Industrial Average (DJIA) price movements based on simpler mining techniques and text similarity measures, in order to detect and characterise relevant tweets that lead to increments and decrements of DJIA. Considering the high level of noise in the social network data, w e also introduce a noise detection method based on a two steps classification. We tested our method on 10 millions twitter posts spanning one year, achieving an accuracy of 88.9% in the Dow Jones daily prediction, that is, to the best our knowledge, the best result in the literature approaches based on social networks
- …